Generalizability and effect measure modification in sibling comparison studies

Sibling comparison studies have the attractive feature of being able to control for unmeasured confounding by factors that are shared within families. However, there is sometimes a concern that these studies may have poor generalizability (external validity) due to the implicit restriction to families that are covariate-discordant, i.e., those families where at least two siblings have different levels of at least one of the covariates (exposure or confounders) under investigation. Even if this selection mechanism has been noted by many authors, previous accounts of the problem tend to be brief. The purpose of this paper is to provide a formal discussion of the implicit restriction to covariate-discordant families in sibling comparison studies. We discuss when and how this restriction may impair the generalizability of the study, and we show that a similar generalizability problem may in fact arise even when all families are covariate-discordant, e.g. even if the exposure is continuous so that all siblings have different exposure levels. We show how this problem can be solved by using a so-called marginal between-within model for estimation of marginal exposure effects. Finally, we illustrate the theoretical conclusions with a simulation study.


Introduction
Unmeasured confounding is a serious threat to the validity of observational studies. While measured confounders can be accounted for using an array of statistical techniques, the sibling comparison design is capable of also reducing confounding bias from unmeasured and even unknown confounding factors. By comparing differentially exposed siblings within families, rather than between unrelated subjects, the study implicitly controls for all measured and unmeasured factors that siblings from the same family have in common [1,2]. These factors may, for instance, include early childhood environment and upbringing, and stable parental characteristics such as socioeconomic status and education.
A prominent special case is the co-twin control study, which, if restricted to monozygotic twins, completely controls for all heritable genetic factors.
Despite the strong appeal of sibling comparison studies, there is sometimes a concern that these studies may have poor generalizability (external validity) due to various selection mechanisms. One obvious such selection mechanism is the explicit restriction to twins in a co-twin control study; if twins are different from ordinary siblings in important aspects, the exposure effect estimated in twins may not generalize well. Another selection mechanism applies to all sibling comparison studies, through the implicit restriction to families with more than one child. As the aim is to compare siblings within families, only families with more than one child can contribute with information to the study.
A more subtle selection mechanism, which also applies to all sibling comparison studies, is an implicit restriction to families that are covariate-discordant. Here, we use the term 'covariate' for both the exposure of interest and for other measured variables (e.g. confounders) that the researcher controls for in the analysis. That a family 1 3 is 'covariate-discordant' means that there are at least two siblings in the family with different levels on at least one of the covariates; this is a necessary condition for the family to be informative about covariate-outcome associations within (conditional on) the family. Conversely, a family is covariate-concordant, and thus non-informative, if all siblings in the family have equal levels on all covariates. When there are no other covariates in the study than the exposure of interest, only exposure-discordant families are informative about the exposure-outcome association within the family. With more covariates in the study the situation is more complex, since a family that is exposure-concordant may be indirectly informative about the within-family exposure-outcome association, to some extent. This happens if the family is discordant on, and thus informative about, at least one of these other covariates, since the within-family associations of all covariates are entangled with each other (technically, they are not likelihood orthogonal). For studies analyzed with conditional logistic regression only families that are 'doubly discordant' (at least two siblings in the family with different covariate and outcome levels) are informative [3].
Even if the aforementioned selection mechanisms have been noted by many authors several issues remain unclear, in particular related to the restriction to covariate-discordant families. First, previous accounts of the problem tend to be brief and rarely go beyond stating the obvious (i.e., that selection may reduce generalizability), without commenting on how or to what extent this would influence the particular situation [1,[5][6]23]. Second, it has been claimed that the problem is moot when the exposure of interest is continuous; for instance, Hutcheon and Harper [7] write that a continuous exposure helps to 'maintain the generalizability of our findings'. On a superficial level this claim may seem reasonable, since essentially all families with more than one child are exposure/covariate-discordant and thus contribute with information to the study, if the exposure is truly continuous and measured with high accuracy. However, some siblings may still have very similar exposure levels, and are thus close to concordant. One could imagine a gradient in the amount of information provided, where 'less discordant' siblings provide less information and the dichotomization into 'discordant' and 'concordant' are just extreme ends of a continuous spectrum. To our knowledge this has not been investigated and it is unclear if and how such an 'information gradient' would affect the generalizability of the study.
The purpose of this paper is to provide a formal discussion of the implicit restriction to covariate-discordant families in sibling comparison studies. We focus on this type of selection mechanism since it is arguably the most subtle and least understood among those mentioned above. The paper is organized as follows: We first establish the notation, definition and assumptions that we will use throughout the paper. We next discuss when and how the restriction to covariate-discordant families may impair the generalizability of the study, and we show that a similar problem may indeed arise even when the exposure is continuous and all families are covariate-discordant. After that we show how this problem can be solved by using a marginal betweenwithin model for estimation of marginal exposure effects, and we illustrate the theoretical results with a simulation study. Finally, we discuss how the presence and magnitude of non-generalizability can be assessed in practice.

Notation, definitions and assumptions
Let X ij and Y ij be the exposure and outcome of interest, respectively, for sibling j in family i, for j = 1, … , n i and i = 1, … , n . Let U i be the full set of confounders, measured or unmeasured, that are shared (i.e., constant) in family i. The purpose of sibling comparison studies is to implicitly control for U i by comparing differentially exposed siblings within the same family. Let C ij be the set of measured nonshared confounders for sibling j in family i. To keep notation simple we treat C ij as a scalar but in practice it may often be a vector of variables. Finally, let S i be the indicator of family i being selected into (e.g. being informative for) the study; S i = 1 for 'selected' (informative) and S i = 0 for 'not selected' (non-informative). For any variable V ij we define the vector i = (V i1 , … , V in i ) , the family mean V i = ∑ n i j=1 V ij ∕n i a n d t h e f a m i l y v a r i a n c e [8] in Fig. 1 illustrates the assumed relations between U i , i , i , i and S i , for a family with two siblings, with obvious generalization to larger families. The dashed double-headed arrows between U i and (C i1 , C i2 ) indicate that U i may affect (C i1 , C i2 ) , but also the other way around. The square box around S i = 1 indicates the implicit conditioning by selection into the study. The arrows from Fig. 1 Causal diagram illustrating a sibling comparison study, with restriction to covariate-discordant families (X i1 , X i2 ) and (C i1 , C i2 ) to S i represent the restriction to covariate-discordance; a family with size n i is only selected into (informative for) the study if (X i1 , … , X in i ) are not all equal or (C i1 , … , C in i ) are not all equal. As noted in the Introduction we will focus on this restriction and ignore other possible selection mechanisms. For instance, we ignore the possibility that the restriction to families with more than one child may introduce a direct effect of U i on S i , since the factors that determine whether the parents strive to have more than one child (e.g. socio-economic status) may also confound the exposure and the outcome. We also noted in the Introduction that, in the special case when the outcome is binary and analyzed with conditional logistic regression, only doubly discordant families contribute to the analysis. Hence, in this special case S i is also directly affected by Y i1 , … , Y in i .
The causal diagram in Fig. 1 encodes two important assumptions; no unmeasured non-shared confounding and no carryover effects, i.e., no effect of the exposure and/or outcome of a sibling on the exposure and/or outcome of other siblings. In practice, both assumptions may often be violated to some extent. However, to keep the discussion focused on selection mechanisms and generalizability issues we assume that both assumptions hold and refer to Frisell et al. [9] and Sjölander et al. [10] for further discussions of unmeasured non-shared confounding and carryover effects, respectively.
When analyzing sibling data it is common to assume a fixed effects model on the form where E(Y ij |U i , X ij , C ij ) is the conditional mean of the outcome, given (U i , X ij , C ij ) and g is an appropriate link function, typically the identity link, the log link or the logit link. Using the logit link leads to what is often referred to as 'conditional logistic regression'. The term 'fixed' refers to the intercept 0i , which is a categorical parameter with a fixed level per family. This intercept is intended to absorb, and thereby control for, the shared confounders U i . The parameter X measures the conditional association between X ij and Y ij , given U i and C ij . In the absence of unmeasured non-shared confounders, X has a causal interpretation as the conditional causal effect of X ij on Y ij , given (U i , C ij ) . This parameter is usually estimated with conditional maximum likelihood; we refer to the Appendix for details.

Covariate-discordance and effect measure modification by shared confounders
To see intuitively how the restriction to covariate-discordant families may cause generalizability problems, note that the shared confounders are statistically associated with the selection into the study since U i has an indirect effect on S i in Fig. 1 mediated through (X i1 , … , X in i ) , and possibly also through (C i1 , … , C in i ) . As a result, those families that are selected into the study will generally have a different distribution of the shared confounders than those not selected into the study. If the shared confounders also modify the effect of the exposure on the outcome, then the observed effect among the covariate-discordant families will not generally be the same as the effect among the covariate-concordant families, and will thus not generalize to the whole population of both covariate-discordant and covariate-concordant families.
To quantify the problem, and to show that a similar problem may arise even for continuous exposures, we consider the fixed effects model (1). For pedagogical purposes we temporarily ignore the measured non-shared confounders C ij , so that covariate-discordance/concordance is equivalent with exposure-discordance/concordance, and the assumed model is given by It is easy to show (see the Appendix) that the fixed effects model estimate ̂ X only draws information from the exposure-discordant families, as expected by intuition. However, it is important to note that this is a feature of the estimation process, not of the model per se, which makes no reference to exposure-discordance/concordance. The model conditions on U i , not on exposure-discordance, and the within-family effect X is assumed to be constant across all levels of U i , i.e., it is assumed that there is no effect measure modification by the shared confounders. Hence, if the model is correct the restriction to exposure-discordant families in the estimation process does not cause any generalizability problems.
To see how generalizability problems may nevertheless arise, suppose that model (2) is not correct, since there is in fact effect measure modification by the shared confounders. Thus, the true model is given by where the parameter X is a function of U i . Suppose further that g is the identity link. It can then be shown (see the Appendix) that, regardless of whether X ij is binary or continuous, the estimate ̂ X under the incorrectly assumed model (2) converges to a weighted average of the U i -specific effects: The weights are directly proportional to the conditional variance of X ij , given U i : The expression in (4) shows that all families do not contribute equally to the estimate of X ; the families that have levels of U i associated with a high variability in X ij tend to be more informative. Hence, in the presence of effect measure modification by U i , ̂ X does not converge to the marginal (average) exposure-outcome effect E{ X (U i )} across all families, and does not estimate a parameter that is generally representative of the whole population.
An important special case is when the terms (n i − 1)Var(X ij |U i ) and X (U i ) are independent. This happens, for instance, when n i and Var(X ij |U i ) are independent of U i , e.g. the shared confounders have no influence on family size or the variability in the exposure. In this special case, it follows from (4) that ̂ X converges to E{w(U i )}E{ X (U i )} , which is equal to E{ X (U i )} since E{w(U i )} = 1 . Wooldridge [11] (Section 11.7.3) provided a similar results, but without providing the analytic expression for the large sample limit of ̂ X in (4). The problem of non-generalizability is most accentuated when some families are exposure-concordant, since the estimate ̂ X gives 0 weight to these families (or, more precisely, it gives 0 weight to the exposure effect X (U i ) at values of U i for which Var(X ij |U i ) = 0 ). However, as suggested in the Introduction, this is just an extreme end of a continuous spectrum; a similar problem may arise even if the exposure is continuous and all families are exposure-discordant, as the following example illustrates.
Suppose that U i is a single variable having a standard (mean 0, standard deviation 1) normal distribution, and that X (U i ) = U i . For positive values of U i we have that Since U i is distributed symmetrically around 0 these deviations from 0 cancel out, so that the marginal exposure effect is 0. However, if n i is independent of U i we have from (4) that which is not generally equal to 0. For instance, suppose that is the standard normal CDF. Then, it can be shown numerically that ̂ X does not converge to 0, but to 0.4. We return to this example in the 'Simulation' section.
The asymptotic limit of ̂ X is most straight forward and intuitive for the simple scenario above, where the fixed effects model has an identity link and includes no measured non-shared confounders. In the Appendix we derive the asymptotic limit of ̂ X for three other scenarios: the fixed effects model has an identity link and includes one measured non-shared confounder, and the fixed effects model has a log link or a logit link and includes no measured non-shared confounders; for the latter two scenarios we also assume that X ij is binary and n i = 2 . We show that the asymptotic limit of ̂ X is a weighted average of the U i -specific effects for all these scenarios. Although the weights have more complex expressions than for the simple scenario above, a common feature is that the weights increase with Var(X ij |U i ) (all other factors constant), and are equal to 0 for values of U i such that Var(X ij |U i ) = 0.

Estimation of marginal effects with marginal between-within models
The generalizability problem discussed in the previous section arises because we use a fixed effects model that assumes no effect measure modification by the shared confounders, when in fact such effect measure modification is present. Although effect measure modification can in principle be modeled by including an interaction (product) term between the exposure and the shared confounders, such interaction is not estimable with conditional maximum likelihood [12]. If the shared confounders were measured, then one could solve the problem by stratifying on the confounders and estimating the conditional effect at each level of these. In practice though, this is typically not possible since the shared confounders are unmeasured to a large extent; indeed, this is the motivation for pursuing a sibling comparison study.
A more feasible way to avoid bias due to effect measure modification is to use a so-called between-within (BW) model. This model comes in two versions; as a marginal [13,14] or a conditional model [15][16][17]. Like the fixed effects model, the conditional BW model conditions on the shared confounders, and assumes no effect measure modification by these. In contrast, the marginal BW model marginalizes over the shared confounders, and is thus ignorant about the presence or absence of effect measure modification.
The central idea in (both marginal and conditional) BW models is to assume that the shared confounders U i can be represented by a measurable 'proxy variable' D i , which is a function of the vectors ( i , i ) . Technically, i and i are assumed to be conditionally independent of U i , given D i : we refer to assumption (5) as the 'BW assumption'. The proxy variable D i is often taken to be the vector of means (X i , C i ) ; however, to make the BW assumption more realistic we may also include, for instance, the variances (X i , C i ) [17]. In some studies the family size n i is likely related to the shared confounders U i . This may for instance be the case in developing countries, where the number of children is often strongly related to familial socio-economic status. In such situations one may include n i in D i to make the BW assumption more realistic. In some special cases the BW assumption is guaranteed to hold. In particular, in the absence of measured non-shared confounders the BW assumption holds with D i = (X i , X i , n i ) when X ij has a normal distribution, and with D i = (X i , n i ) when X ij has a Bernoulli distribution (e.g. when X ij is binary), regardless of how U i is distributed (see the Appendix).
Under the BW assumption we can estimate the marginal effect of taking the exposure from, say, x = 0 to x = 1 in the target population with a marginal BW model, using a fivestep procedure. We give a brief explanation of the procedure here, and refer to Sjölander [14] for a more rigorous description. In the first step we fit a marginal BW model on the form where h is a regression function and is a vector of model parameters. A standard linear BW model assumes that In this model, the coefficients for X i and X ij are referred to as the 'between-' and 'within-effect' of the exposure, respectively; thus the term 'BW model'. In the second step we manipulate the observed data by replacing each subject's factual exposure level with the fixed level 0, but without changing the factual levels of C ij and D i . In the third step we use the fitted model to predict the outcome for each subject in the manipulated data, i.e., for each observed level of C ij and D i . For a given subject, this prediction is an estimate of the counterfactual outcome, had the exposure been set to 0 for that subject while holding the values of C ij and D i constant. In the fourth step we average these predictions to obtain an estimate of the mean outcome, had the exposure been set to 0 for all subjects. Finally, in the fifth step we repeat the procedure for x = 1 , and contrast the estimated counterfactual means for x = 0 and x = 1 to obtain an estimate of the marginal exposure effect. This estimation procedure is a form of regression standardization [18], and can easily be carried out with, for instance, the package stdReg in R [19,20]. Asymptotic confidence interval and p-values for the estimates can be computed with standard theory for estimating equations as described by Sjölander [14], and can also be obtained from the stdReg package.
We emphasize three important points regarding the model and estimation procedure outlined above. First, although the marginal BW model does not explicitly include (condition on) the shared confounders, it implicitly controls for these through the proxy variable D i . Second, in contrast to the fixed effects model (1), the marginal BW model (6) makes no assumption about the presence or absence of effect measure modification by the shared confounders. Thus, the estimated marginal exposure effect is (asymptotically) unbiased also when such effect measure modification is present, provided that the marginal BW model (6) is correctly specified and the BW assumption (5) holds. Third, standard regression models are often kept simple to make the results transparent and interpretable. However, this is not necessary for the marginal BW model, since the fitting of this model is just the first step in the five-step estimation procedure, and the end product of the procedure (the counterfactual mean outcome) will be no less interpretable if the underlying model is complex than if the model is simple. Hence, to ensure that the model is realistic one may use a relatively complex model specification, including, for instance, splines and interaction terms. We illustrate these points in the next section with a simulation.

Simulation
In this section we present the results from a simulation study, demonstrating the conclusions and methods from previous sections. We simulated samples of families with 1, 2, 3 or 4 siblings using probabilities 0.2, 0.5, 0.2, and 0.1 to match the distribution of siblings in Sweden [21]. We ignored nonshared confounders C ij and generated the shared confounders, the exposure and the outcome from the model We generated both a continuous outcome from a normal distribution with unit variance, for which g was the identity link, and a binary outcome, for which g was the logit link. The parameter X (U i ) determines the degree of effect measure modification by U i . We considered three scenarios; X (U i ) = 0 (no effect measure modification), X (U i ) = U i (positive effect measure modification) and X (U i ) = −U i (negative effect measure modification). In the first scenario the exposure effect is 0, both conditionally and marginally over U i . In the other two scenarios the conditional exposure effect depends on U i , but the marginal exposure effect is still 0 since positive and negative conditional effects cancel out.
It can be shown (see the Appendix) that under the conditional (on U i ) model in (7) the BW assumption (5) holds with D i = (X i , X i ) . However, the true regression function h(D i , X ij ; ) in (6) is rather non-standard and unintuitive. We emphasize that this complexity is a consequence of using a simple conditional model. Alternatively, we could have started by formulating a simple marginal model, which would typically correspond to a complex conditional model. In real scenarios there is no a priori reason why either model would be more plausible than the other.
We generated 1000 samples of n = 1000 families each from the model in (7). For each sample we estimated the conditional exposure effect X in the fixed effects model (2), using the gee function in the drgee package in R. This model is correct when there is no effect measure modification ( X (U i ) = 0 ) but otherwise incorrect. We also estimated the marginal effect of increasing the exposure from x = 0 to x = 1 with two different marginal BW models, fitted with the stdGlm function in the stdReg package. For this marginal effect we used the same link function as in the fixed effects model, i.e. an identity link for the continuous outcome and a logit link for the binary outcome. The first BW model was the standard model This model incorrectly assumes that the BW assumption (5) holds with D i = X i , and that the relation between Y ij and (D i , X ij ) is linear, on the scale defined by the link function g. The second BW model was defined as where s(X i ) and s(X i ) are natural cubic spline functions of X i and X i , with knots at the three quartiles in their sample distributions. This model correctly assumes that the BW assumption (5) holds with D i = (X i , X i ) , but uses spline functions to approximate the correct regression function h(D i , X ij ; ) . For the continuous outcome (i.e. when g was the identity link) we additionally fitted the correct marginal BW model, as derived in the Appendix. We did not attempt to fit this model for the binary outcome, since the logit link function makes the model very complex and computationally demanding.
For each fitted model we computed the mean and standard deviation of the estimates over the 1000 samples, together with the mean standard error as obtained from the gee and stdGlm functions, and the empirical coverage probability of the 95% confidence interval estimate±1.96×standard error. R code for the simulation is given in the Appendix. Table 1 shows the result. In the absence of effect measure modification by U i , all estimates are virtually unbiased (mean estimates equal to 0). However, in the presence of negative and positive effect measure modification by U i , the mean estimates are equal to −0.4 and 0.4 for the linear fixed effects model, and −0.35 and 0.07 for the logistic fixed effects model, thus biased to various degree. The linear standard BW model has the same degree of bias as the linear fixed effects model for these scenarios, whereas the logistic standard BW model has considerably less bias (mean estimates −0.07 and 0.02) than the logistic fixed effects model. The linear spline BW model is virtually unbiased (mean estimates 0.01 and −0.01 ), whereas there appears to be a slight bias in the logistic spline BW model (mean estimates −0.04 and −0.03 ). Finally, the correct linear BW model appears to be virtually unbiased (mean estimates 0.01 and −0.01 ). All mean standard errors agree well with the corresponding standard deviations of the estimates, even when the estimates are badly biased. This is expected, given that the gee and stdGlm functions provide robust 'sandwich' standard errors, which do not rely on the specified model  [22]. However, the 95% confidence intervals only have close to nominal 95% coverage for those scenarios where the estimate is almost unbiased. This is also not surprising, since any asymptotic bias in the estimate, even if minuscule, forces the coverage probability of the confidence interval estimate±1.96×standard error to 0 as the sample size goes to infinity. Notably, the standard deviation of the estimates from the linear spline BW model are no larger than the standard deviation of the corresponding estimates from the correct linear BW model. This indicates that no statistical efficiency was lost by approximating the correct model with splines. The standard deviation of the estimates from the logistic BW models are considerably smaller than the standard deviation of the corresponding estimates from the logistic fixed effects model. This indicates that, for the logistic link function, the BW model does not only facilitate estimation of marginal effects, but also gives higher statistical efficiency.
In practice, it is unlikely that the analyst would be able to correctly specify a non-standard and unintuitive model such as the true regression function h(D i , X ij ; ) in our simulation. In contrast, even though the spline model in (8) has a rather complex mathematical expression, it only contains standard spline functions that are available in all major statistical software. It is thus reassuring that the spline BW models gave close to unbiased estimates in our simulation, since it indicates that the analyst may not need to specify the true model correctly; using standard spline functions as an approximation may suffice.

Assessing the presence and magnitude of non-generalizability
Even though the issue of generalizability in sibling comparison studies has received little formal attention, several authors have proposed informal methods to assess the magnitude of the problem. D'Onofrio et al [23] and Class et al [24] used sibling comparison designs to study the effects of preterm birth and fetal growth restrictions on mortality and psychiatric morbidity. As a sensitivity analysis they estimated the exposure-outcome associations using ordinary (i.e. not fixed effects) regression models, fitted separately to families with two or more children (informative for their sibling comparisons) and to families with only one child (non-informative). They obtained similar estimates in the two groups, which they took as evidence for generalizability of sibling comparison estimates from the former group to the latter. Gebremedhin et al [25] used a sibling comparison design to study the effect of interpregnancy interval on hypertensive disorder. As a sensitivity analysis they compared the distribution of measured covariates (e.g. maternal age at first birth, marital status, ethnicity) between families with three or more children (informative for their sibling comparisons) and families with one or two siblings (non-informative). Like D'Onofrio et al [23] and Class et al [24] they observed no major differences between the groups, which they took as evidence for generalizability.
Although such sensitivity analyses may be informative to some extent, they do not provide definitive evidence. It is possible that both the marginal (over the shared confounders) exposure-outcome association and the distribution of measured covariates are similar between the informative and non-informative families, yet the conditional (on the shared confounders) exposure-association is different, and vice versa. Furthermore, such sensitivity analyses are only useful to the extent that there is a clear-cut between the informative and non-informative families. Families with only one child are clearly non-informative in these studies, as well as families with two children in the study by Gebremedhin et al. However, we have shown that for continuous exposures such as interpregnancy interval there is also a gradient in the information provided, where 'less discordant' siblings provide less information. This feature is not addressed in the sensitivity analyses by D'Onofrio et al [23], Class et al [24] and Gebremedhin et al [25] .
An alternative way of assessing the potential for non-generalizability is to consider the underlying mechanisms of the problem. We have shown that non-generalizability arises in sibling comparison studies because of effect measure modification by the shared familial confounders, and we have argued that the problem is compounded if the variability in the exposure depends on the shared confounders. In many studies, some of the shared confounders are measured, which makes it possible to partly assess these mechanisms. As a concrete example, Gebremedhin et al [25] provided data on interpregnancy interval categorized into 7 categories, for Caucasian and non-Caucasian mothers separately; we have reproduced these data in Table 2. A standard 2 -test gives a p-value less than 2.2 × 10 −16 for these data; thus, ethnicity is associated with interpregnancy interval. A common measure of variation for nominal variables is the HREL index [26], which is a scaled version of the Shannon entropy. Computing this measure for the data in Table 2 gives very similar figures, 0.88 and 0.90, for the Caucasian and non-Caucasian mothers. This indicates that, although 'ethnicity' may be an important shared confounder in the study by Gebremedhin et al [25], it is not plausibly a major source of non-generalizability for their sibling comparisons. If no major differences are found in exposure-variation for any other measured non-shared confounders, then this may taken as evidence for generalizability of the sibling comparison estimates. A similar caveat as above applies here as well though; it is possible that the exposure has similar variation across levels of all measured confounders, but varies strongly across levels of one or several unmeasured confounders. Yet another way to assess the magnitude of non-generalizability is to fit both a fixed effects model and a marginal BW model, and compare the estimates. If these are similar, then this may be taken as evidence for generalizability of the fixed effects model estimate. The converse is more questionable though since there could be several explanations for a difference between the estimates, including bias due to misspecification of the marginal BW model, violation of the BW assumption, or (e.g. for logistic models) non-collapsibility of the chosen effect measure [27]. The fit of the marginal BW model can be assessed with standard diagnostic tools, and the model can be refined until a reasonable fit is achieved. If some of the shared confounders are measured, as in the study by Gebremedhin et al [25], one can verify that that BW assumption holds with respect to these, for the particular choice of D i . However, this does not guarantee that the assumption holds with respect to the unmeasured confounders.
In summary, even though the presence and magnitude of non-generalizability can be assessed empirically, such empirical tests can only provide limited evidence. Hence, when judging whether a particular study may suffer from substantial generalizability problems it is also important to use subject matter knowledge about the situation at hand. In some situations one may have a priori reason to believe that there is no substantial effect measure modification by the shared confounders, or that the variation in the exposure is fairly constant across the shared confounders, in which case one may conclude that the sibling comparison estimates are likely to generalize well. However, if such subject matter knowledge is lacking, and the empirical tests discussed above are either not applicable or deemed unreliable, then one should be rather cautious to generalize the results from the sibling comparison study to the whole population.

Discussion
The sibling comparison design is an important component in the epidemiologic toolbox. However, it has subtle features that are not present in simpler designs, which must be properly understood in order to interpret the results correctly. We have shown how the selection of covariate-discordant families in sibling comparison studies may affect the generalizability of the results, and that a similar generalizability problem may arise even if all families are covariate-discordant (e.g. if the exposure is continuous) if there is effect measure modification by the shared familial confounders. We have demonstrated that the problem can be solved by using a marginal BW model to estimate the marginal exposure effect.
When the exposure effect varies across levels of confounders (or other covariates), stratum-specific effects are often of greater public health interest than the marginal effect, since they can be used to answer more detailed and relevant questions regarding specific subpopulations (e.g. patients with certain characteristics). However, as noted in Section 'Estimation of marginal effects with marginal between-within models' it is typically not possible to stratify on all the shared confounders in sibling comparison studies, since these are unmeasured to a large extent. Thus, it seems like the best one could hope for is an approximately unbiased estimate of an effect that is representative for the aggregated population, e.g. a marginal exposure effect.
In our simulation, we used a continuous exposure. Marginal BW models can be used with binary exposures as well since the underlying theory for the model, as described briefly in this paper and more thoroughly by Sjölander [14], makes no assumption about whether the exposure is binary or continuous. However, when the exposure is binary the marginal BW model tends to make strong extrapolation outside the observed data, and may thus be sensitive to the parametric model assumptions. Essentially, this is because when the exposure is binary the value of the proxy variable D i typically restricts the set of possible exposure values for each individual sibling in the family. For instance, suppose that D i is taken to be the exposure mean X i , and observed to be equal to 1 in a particular family. Then, the individual exposure value X ij must also be equal to 1 for all siblings in that family. Thus, when predicting the counterfactual outcome for an individual j in that family, had X ij been hypothetically set to 0, the model has to rely completely on information from other families with X i ≠ 1 . We provide a more detailed discussion of marginal BW models for binary exposures in the Appendix.

A Conditional maximum likelihood under the fixed effects model (1)
When using conditional maximum likelihood estimation of X in model (1), the conditional conditional distribution of Y ij , given (U i , X ij ) , is assumed to be on the form where the canonical parameter ij is equal to 0i + X X ij . It follows that where the first equality follows from the fact that Fig. 1. Inference is conditioned on n i Y i , which is a sufficient statistic for 0i . The conditional maximum likelihood contribution for set i becomes equal to For an exposure-concordant sibling set with X i1 = … X in i , the conditional likelihood contribution for set i simplifies to This does not depend on X , hence the exposure-concordant sibling sets are non-informative.

B Asymptotic limit of X under effect measure modification by U i
If g is the identity link, then Y ij is assumed to have a normal distribution with mean ij = 0i + X X ij and constant variance 2 , given (U i , X ij ) , so that n i Y i has a normal distribution with mean n i i and variance n i 2 , given (U i , i ) . It follows that Hence, the conditional maximum likelihood score contribution from set i is If g is the log link, then Y ij is assumed to have a Poisson distribution with mean ij = exp( 0i + X X ij ) , given (U i , X ij ) , so that n i Y i has a Poisson distribution with mean n i i , given (U i , i ) . It follows that Hence, the conditional maximum likelihood score contribution from set i is When n i = 2 and X ij is binary, this simplifies to If g is the logit link, then Y ij is assumed to have a Bernoulli distribution with mean ij = expit( 0i + X X ij ) , given (U i , X ij ) , so that which is the usual conditional logistic regression likelihood. Hence, the conditional maximum likelihood score contribution from set i is When n i = 2 and X ij is binary, this simplifies to It follows from standard theory for estimating equations that ̂ X converges to the solution to the equation For the score contribution in (9) we have, under model (3), that so that (10) It follows that ̂ X converges to the expression in (4). For the score contribution in (10) we have, under model (3), that so that It follows that where We have that w(U i ) increases with Var(X ij |U i ) , and that For the score contribution in (11) we have, under model (3), that Again, we have that w(U i ) increases with Var(X ij |U i ) , and that Var(X ij |U i ) = 0 ⇒ w(U i ) = 0.
Consider now the situation where Y ij is assumed to have a normal distribution with mean ij = 0i + X X ij + C C ij and constant variance 2 , given (U i , X ij , C ij ) , for a scalar covariate C ij . Arguing as above, the conditional maximum likelihood score contribution from set i is Suppose now t hat t he tr ue mean of Y ij i s It follows, after some algebra, that where As before we have that we have that w(U i ) increases with Var(X ij |U i ) . Furthermore, it follows from the Cauchy-S c h w a r z i n e q u a l i t y t h a t Cov(

C Special cases when the BW assumption is guaranteed to hold
In the absence of measured non-shared confounders we have that Thus, p(U i | i ) only depends on i through the term

D Correct marginal BW model implied by the conditional model (7)
Under model (7) we have that and (u; , 2 ) being the normal density function with mean and variance 2 . Since U i only depends on i through D i , it follows that D i satisfies the BW assumption (5); see Sjölander [14], equation 2.2.

E Marginal BW models for binary exposures
In this section we provide further discussion of the marginal BW model with a binary exposure. As a motivating example, suppose that n i = 2 for all i, is the potential outcome [8] for sibling j in family i, if the exposure were set to level x. The observed data distribution defines the four parameters ( 0,0 , 0,0.5 , 1,0.5 , 1,1 ) . However, the two parameters ( 0,1 , 1,0 ) are not well defined since X i = 1 is logically incompatible with X ij = 0 , and X i = 0 is logically incompatible with X ij = 1 . This is analogous to violations of the positivity assumption in the context of inverse probability weighting [28].
For this example, the BW assumption (5) trivially holds, since i is a constant (up to the arbitrary ordering of X i1 and X i2 ) conditional on D i = X i , thus conditionally independent of U i , given D i . We further have that ̃ x,X i and x,X i are related through It follows from results by Sjölander [14] (Supplementary material, Section 3) that, in order for regression standardization to give consistent estimates of marginal causal effects, the regression function h(X i , X ij ; ) in (6) must be a correct model for ̃ x,X i when the observed exposure level X ij is replaced in the regression function by the fixed constant x. Notably, this requirement must be met even when the observed X i is logically incompatible with an observed value X ij = x . This makes clear that regression standardization makes a strong extrapolation when estimating (̃ 0,1 ,̃ 1,0 ) , since these are not determined by the observed data distribution. For instance, suppose that we consider two different parametric models h(X i , X ij ; ) , say M 1 and M 2 . Suppose that neither M 1 nor M 2 imply any restrictions on ( 0,0 , 0,0.5 , 1,0.5 , 1,1 ) . Then, both M 1 and M 2 fit the observed data equally well, so that the observed data cannot be used to distinguish between the models. Furthermore, the (ML) estimates of ( 0,0 , 0,0.5 , 1,0.5 , 1,1 ) are equal to the corresponding sample means for both M 1 and M 2 , so that M 1 and M 2 give the same estimates of (̃ 0,0 ,̃ 0,0.5 ,̃ 1,0.5 ,̃ 1,1 ) through the relations in (13). However, M 1 and M 2 may still give different estimates of (̃ 0,1 ,̃ 1,0 ) , when h(X i , x; ) is interpreted as a model for ̃ x,X i . As a concrete example, suppose that we wish to estimate the marginal mean difference and consider the following two models: and It is easy to see that neither of these models imply any restrictions on ( 0,0 , 0,0.5 , 1,0.5 , 1,1 ) . However, by interpreting h(X i , x; ) as a model for ̃ x,X i we have, for M 1 , that and which is the mean difference among the exposure-discordant pairs. In contrast, for M 2 we have that and Thus, the obtained estimate of depends heavily on whether we use model M 1 or model M 2 .